47 research outputs found

    A quick search method for audio signals based on a piecewise linear representation of feature trajectories

    Full text link
    This paper presents a new method for a quick similarity-based search through long unlabeled audio streams to detect and locate audio clips provided by users. The method involves feature-dimension reduction based on a piecewise linear representation of a sequential feature trajectory extracted from a long audio stream. Two techniques enable us to obtain a piecewise linear representation: the dynamic segmentation of feature trajectories and the segment-based Karhunen-L\'{o}eve (KL) transform. The proposed search method guarantees the same search results as the search method without the proposed feature-dimension reduction method in principle. Experiment results indicate significant improvements in search speed. For example the proposed method reduced the total search time to approximately 1/12 that of previous methods and detected queries in approximately 0.3 seconds from a 200-hour audio database.Comment: 20 pages, to appear in IEEE Transactions on Audio, Speech and Language Processin

    Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

    Full text link
    Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field. Our code is available at: https://github.com/nttcslab/m2d/tree/master/speechComment: Interspeech 2023; 5 pages, 2 figures, 6 tables, Code: https://github.com/nttcslab/m2d/tree/master/speec

    INTER-TRIAL DIFFERENCE ANALYSIS THROUGH APPEARANCE-BASED MOTION TRACKING

    Get PDF
    The purpose of this study is to develop a method for quantitative evaluation and visualization of inter-trial differences in the motion of athletes. Previous methods for kinematic analyses of human movement have required attaching specific equipment to a body segment or can only be used in an environment designed for analyses. Therefore, they are difficult to use for observing motions in real games. To enhance the applicability to real-game situations, we propose appearance-based motion tracking. Our method only requires an image sequence from a camera. From the image sequence, automatic detection of trials and a difference analysis of them are conducted. We applied our method to the analysis of pitching motions in actual baseball games. Though we have no quantitative evaluations yet, the experimental results imply the efficacy of our method

    Deep Attentive Time Warping

    Full text link
    Similarity measures for time series are important problems for time series classification. To handle the nonlinear time distortions, Dynamic Time Warping (DTW) has been widely used. However, DTW is not learnable and suffers from a trade-off between robustness against time distortion and discriminative power. In this paper, we propose a neural network model for task-adaptive time warping. Specifically, we use the attention model, called the bipartite attention model, to develop an explicit time warping mechanism with greater distortion invariance. Unlike other learnable models using DTW for warping, our model predicts all local correspondences between two time series and is trained based on metric learning, which enables it to learn the optimal data-dependent warping for the target task. We also propose to induce pre-training of our model by DTW to improve the discriminative power. Extensive experiments demonstrate the superior effectiveness of our model over DTW and its state-of-the-art performance in online signature verification.Comment: Accepted at Pattern Recognitio
    corecore